Skip to content

[SPARK-56295][PYTHON] Add Java error classes to Python side#55100

Open
gaogaotiantian wants to merge 3 commits intoapache:masterfrom
gaogaotiantian:load-java-error-classes
Open

[SPARK-56295][PYTHON] Add Java error classes to Python side#55100
gaogaotiantian wants to merge 3 commits intoapache:masterfrom
gaogaotiantian:load-java-error-classes

Conversation

@gaogaotiantian
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Include Java side error classes in Python so python can generate consistent error messages as Java with the same error class + message parameters.

Why are the changes needed?

Currently we have two separate error-conditions.json files, for Python and Java. I think it's historical reason that we can't easily load Java's json on Python so we created a Python specific one. The issue is that two files are unsynced. It's possible that we want to raise the same error on both Python and Java, and we either need to create a new error class on Python json file or forget to do that and break message generation.

Ideally we should have a single source of truth, which is achievable now by merging every existing Python error classes into Java one. But there could be unexpected consequences and it's difficult to split them back once it's merged. Moreover, the Python file contains some python-specific error classes which may not fit into the java file.

As a compromise, we keep the Python file and load the Java file from Python. We keep the current structure and make it more flexible. In the future, if we have a clear plan of which way we should go, we can always eliminate the Python json file.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

A new test is added to test Python can raise Java-specific error class.

Was this patch authored or co-authored using generative AI tooling?

No.

ERROR_CLASSES_MAP = json.loads(ERROR_CLASSES_JSON)


def get_error_classes() -> dict[str, dict]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this impact python Spark Connect client-only installations? They would not have the jars installed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. So for now, it won't generate any error. The connect client would miss the java side error classes. As long as connect client only uses python error classes it should be fine.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can add a simple validation for that. Otherwise, the error might fail with a weird error, e.g., if error code in JVM is mistakenly used in Spark Connect code.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Actually I wonder if we just add a linter to sync JSON and Python error classes ....

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add any linter that we want, but we need to first figure out how we should organize our error classes.

We have two questions:

  • how many "real" JSON files do we want (source of truth)
  • how many copies do we want (to make things like packaging easier)

We can have linter to enforce our ideas so we don't have to worry about how to make it work at this point, just what we want.

For this current PR, we have 2 source of truth (JAVA + Python) and no copy. Python has Python + JAVA and JAVA has JAVA only. The good thing about this is:

  • Python always has all the error classes
  • No sync needed
  • Python can have its dedicated error classes (client error for example)

We can have a single error class file put in JAVA source code. We probably have to copy it to python because we need to have that accessible to connect clients (we can actually read it from classic though).

The question is - we probably do have error classes that are specific to Python code, putting those in the JAVA json file is a bit weird to me. That's why I chose this structure.

This indeed introduced the problem of connect client not able to get the JAVA error class - which is actually our current behavior (before this PR). Python does not understand JAVA error class at all before this PR and we are fine. If we want the client to have all the error classes, we can copy the JAVA side json to python and enforce the sync checking in CI.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I prefer is just to make it easier to maintain. e.g., I got some questions about this error class files time to time still, and why are they different.

Another approach might be just add a check if the error class exists in Exception class itself that is only enabled when testing (e.g., SPARK_TESTING environment variable with an assert).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are they different

I think I mentioned this in the long comment I wrote above :) Basically here will be some error classes that are Python specific (like connect client errors), and it would be weird to put it to Java source code. Java also doesn't need to know this.

We can have all kind of validations for this, for now we only need to decide how to organize these error classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants